Content
- The grammar of graphics
- The major components of layers
- Hands on practice
- Visualizations based on the gg approach
The grammar of graphics is about grammatical rules for creating perceivable graphs, or what we call graphics. (Leland Wilkinson, 2005).
Take the analogy: good grammar is just the first step in creating a good sentence.
DATA : a set of data operations that create variables from datasets,TRANS : variable transformations (e.g., rank),SCALE : scale transformations (e.g., log),COORD : a coordinate system (e.g., polar),ELEMENT : graphs (e.g., points) and their aesthetic attributes (e.g., color),GUIDE : one or more guides (axes, legends, etc.).Algebra, the operations that allow us to combine variables and specify dimensions of graphs.Scales involves the representation of variables on measured dimensions.Statistics covers the functions that allow graphs to change their appearance and representation schemes.Geometry covers the creation of geometric graphs from variables.
What is a Scale?
A scale defines how data values are translated into visual properties (aesthetics).
Each aesthetic — color, size, shape, position, etc. — has a corresponding scale.
You can customize scales to control:
Color palettes
Axis limits and breaks
Legend appearance
Transformations (e.g., log, sqrt)
continuous data to size and colordiscrete data to shape and colorGraphical primitives
geom_path()geom_rect()geom_poligon()One variable
geom_bar()geom_histogram()geom_density()Two variables
geom_smooth()geom_point()geom_count()geom_jitter()geom_boxplot().geom_violin()Three variables
geom_contour()geom_tile()geom_raster()| Daily temperature data | ||||||
| station_id | month | day | temperature | flag | date | location |
|---|---|---|---|---|---|---|
| USC00042319 | 01 | 1 | 51.0 | S | 0-01-01 | Death Valley |
| USC00042319 | 01 | 2 | 51.2 | S | 0-01-02 | Death Valley |
| USC00042319 | 01 | 3 | 51.3 | S | 0-01-03 | Death Valley |
| USC00042319 | 01 | 4 | 51.4 | S | 0-01-04 | Death Valley |
| USC00042319 | 01 | 5 | 51.6 | S | 0-01-05 | Death Valley |
| USC00042319 | 01 | 6 | 51.7 | S | 0-01-06 | Death Valley |
p <- ggplot(temps_long,
aes(x = date,
y = temperature,
color = location)
) +
geom_line(linewidth = 1) +
scale_x_date(name = "month",
limits = c(ymd("0000-01-01"), ymd("0001-01-04")),
breaks = c(ymd("0000-01-01"), ymd("0000-04-01"), ymd("0000-07-01"),
ymd("0000-10-01"), ymd("0001-01-01")),
labels = c("Jan", "Apr", "Jul", "Oct", "Jan"), expand = c(1/366, 0)) +
scale_y_continuous(limits = c(19.9, 107),
breaks = seq(20, 100, by = 20),
name = "temperature (°F)") +
scale_color_OkabeIto(order = c(1:3, 7), name = NULL) +
theme_dviz_grid() +
theme(legend.title.align = 0.5)# Create plot
fig, ax = plt.subplots(figsize=(9, 5))
# Use seaborn lineplot; pass palette by mapping
sns.lineplot(
data=lf,
x='date',
y='temperature',
hue='location',
palette=palette_map,
linewidth=1.5, # similar to geom_line linewidth
ax=ax
)
# X-axis limits and breaks (use valid years 2000-01-01 to 2001-01-04)
xmin = pd.to_datetime("2000-01-01")Seaborn(np.float64(10957.0), np.float64(11326.0))
(19.9, 107.0)
heatmapPreprocessing:
location & monthmonth numbers with names| Mean temperature per month | ||
| location | month | mean |
|---|---|---|
| Death Valley | Jan | 53.45161 |
| Death Valley | Feb | 59.94483 |
| Death Valley | Mar | 68.44839 |
| Death Valley | Apr | 76.29333 |
| Death Valley | May | 86.60645 |
| Death Valley | Jun | 95.54667 |
statistical transformationsggplot2 stat_ functions |
|
| Table adapted from Hadley Wickham (2016), | |
| Name | Description |
|---|---|
| bin | Divide continuous range into bins, and count number of points in each |
| boxplot | Compute statistics necessary for boxplot |
| contour | Calculate contour lines |
| density | Compute 1d density estimate |
| identity | Identity transformation, f(x) = x |
| jitter | Jitter values by adding small random value |
| Calculate values for quantile-quantile plot | |
| quantile | Quantile regression |
| smooth | Smoothed conditional mean of y given x |
| summary | Aggregate values of y for given x |
| unique | Remove duplicated observations |
Plot the blue jay relationship between body mass and head length.
| Blue jay dataset | ||||||||
| BirdID | KnownSex | BillDepth | BillWidth | BillLength | Head | Mass | Skull | Sex |
|---|---|---|---|---|---|---|---|---|
| 0000-00000 | M | 8.26 | 9.21 | 25.92 | 56.58 | 73.30 | 30.66 | 1 |
| 1142-05901 | M | 8.54 | 8.76 | 24.99 | 56.36 | 75.10 | 31.38 | 1 |
| 1142-05905 | M | 8.39 | 8.78 | 26.07 | 57.32 | 70.25 | 31.25 | 1 |
| 1142-05907 | F | 7.78 | 9.30 | 23.48 | 53.77 | 65.50 | 30.29 | 0 |
| 1142-05909 | M | 8.71 | 9.84 | 25.47 | 57.32 | 74.90 | 31.85 | 1 |
| 1142-05911 | F | 7.28 | 9.30 | 22.25 | 52.25 | 63.90 | 30.00 | 0 |
blue_jays_base <- ggplot(blue_jays, aes(Mass, Head)) +
scale_x_continuous(limits = c(57, 82), expand = c(0, 0), name = "body mass (g)") +
scale_y_continuous(limits = c(49, 61), expand = c(0, 0), name = "head length (mm)" ) +
theme_dviz_grid()
blue_jays_base +
stat_density_2d(color = "black", size = 0.4, binwidth = 0.004) +
geom_point(color = "black", size = 1.5, alpha = 1/3)sexCommon applications:
Heatmaps, aggregate values into grid cells to display intensity across two dimensionsPrompt: Given a pandas dataframes with more than 200 million rows and an ‘mz’ column having more than 26 million unique values. How can the table be aggregated in such a way that we can create a heat map with mz on the vertical axis, time on the horizontal axis and intensity on the ‘z’ axis (color)?
| Id | Time | scanid | index | intensity | mz |
|---|---|---|---|---|---|
| 1 | 0.312346 | 35 | 376857 | 9 | 1548.487069 |
| 1 | 0.312346 | 38 | 2796 | 9 | 99.330436 |
| 1 | 0.312346 | 38 | 274380 | 9 | 979.736426 |
| 1 | 0.312346 | 38 | 313091 | 9 | 1179.351225 |
| 1 | 0.312346 | 39 | 322655 | 9 | 1231.517479 |
| … | … | … | … | … | … |
| 6481 | 691.492002 | 916 | 55969 | 86 | 200.037732 |
| 6481 | 691.492002 | 916 | 74584 | 138 | 243.540075 |
| 6481 | 691.492002 | 916 | 92301 | 62 | 288.915998 |
| 6481 | 691.492002 | 916 | 218889 | 53 | 725.852351 |
| 6481 | 691.492002 | 917 | 215210 | 99 | 710.363398 |
To aggregate such a large Pandas DataFrame for creating a heatmap, you can use binning and grouping techniques to reduce the data into manageable chunks. Here’s a general approach to achieve this:
# Step 1: Downsample or aggregate the data if necessary to avoid memory issues
# Step 2: Create a pivot table (heatmap structure) by aggregating intensity
heatmap_data = super_df.pivot_table(index=pd.cut(super_df['mz'], bins=mz_bins),
columns=pd.cut(super_df['Time'], bins=time_bins),
values='intensity',
aggfunc='mean') # You can change 'mean' to 'sum' if appropriate
# Step 3: Apply a logarithmic transformation to highlight minority values
# Step 4: Plot the heatmap
plt.figure(figsize=(17, 10)) # Adjust figure size as needed
sns.heatmap(heatmap_data_log, cmap='magma', norm=None,cbar_kws={'label': 'Log(Intensity)'})
# 'magma' gives more emphasis on high values
# Step 6: Show the plot
plt.show()ggplot2| Position Adjustments in ggplot2 | ||
| Position | Description | Commonly Used With |
|---|---|---|
| identity | No adjustment — geoms are placed exactly where data specifies. | geom_point(), geom_bar() |
| stack | Stacks elements vertically along the y-axis. | geom_bar(), geom_area() |
| fill | Like 'stack', but scales bars to show proportions (fills to 100%). | geom_bar(), geom_area() |
| dodge | Places overlapping objects side-by-side for comparison. | geom_bar(), geom_boxplot() |
| jitter | Adds small random variation to reduce overplotting. | geom_point() |
| nudge | Moves text or labels slightly to improve readability. | geom_text(), geom_label() |
facets layersame plot design for each subset.y vs x for each level of variable z”.| Common Coordinate Systems in ggplot2 | |
| Function | Description |
|---|---|
| coord_cartesian() | Default Cartesian coordinates; standard x-y axes. |
| coord_flip() | Swaps x and y axes — useful for horizontal bar plots. |
| coord_fixed() | Ensures fixed aspect ratio between x and y units. |
| coord_polar() | Converts Cartesian to polar coordinates (e.g., pie charts). |
| coord_quickmap() | Approximates a Mercator projection — great for maps. |
| coord_trans() | Applies a mathematical transformation to axes (e.g., log scale). |
Create facets, violin plots, tikz using: